A formal model for dataflows, runs of dataflows, and provenance within runs
نویسندگان
چکیده
Modern scientific research is characterized by extensive computerized data processing of laboratory results and other scientific data. Such processes are often complex, consisting of several data manipulating steps. We refer to such processes as dataflows, to distinguish them from more general workflows. General workflows also emphasize the control flow aspect of a process, whereas our focus is mainly on data manipulation and data management. Important data management aspects of scientific dataflows include, among others:
منابع مشابه
A Formal Model of Dataflow Repositories
Dataflow repositories are databases containing dataflows and their different runs. We propose a formal conceptual data model for such repositories. Our model includes careful formalisations of such features as complex data manipulation, external service calls, subdataflows, and the provenance of output values.
متن کاملAnálise de Estratégias de Acesso a Grandes Volumes de Dados
The efficient processing of big data has become an issue for several areas. In science, researchers have used dataflows to express computational analysis and experiments on data. An important feature of scientific dataflows is that the analysis must scan a large set of data. In this sense, this work investigates alternatives for storing large volumes of data favoring the execution of dataflows ...
متن کاملMULTI-AGENT INFORMATION PROCESSING AND ADAPTIVE CONTROL IN GLOBAL TELECOMMUNICATION AND COMPUTER NETWORKS A.V.Timofeev
The problems and methods for adaptive control and multi-agent processing of information in global telecommunication and computer networks (TCN) are discussed. Criteria for controllability and communication ability (routing ability) of dataflows are described. Multi-agent model for exchange of divided information resources in global TCN has been suggested. Peculiarities for adaptive and intellig...
متن کاملOptimizing ETL Dataflow Using Shared Caching and Parallelization Methods
Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the resources required by ETL dataflows, this paper presents a framework to optimize dataflows using shared cache and parallelization techniques. The framew...
متن کاملSOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (Udfs). However, the heavy use of Udfs is not well taken into account for dataflo...
متن کامل